Project 4: Exploratory Data Analysis

Explore and Summarise Data

Data Analyst Nanodegree (Udacity)

Project submission by Edward Minnett (ed@methodic.io).

February 19th 2017 (Revision 2)


Data Source

This report is a ‘stream of consciousness’ exploration of a pair of data sets. One data set represents the physical characteristics and perceived quality of white wine while other describes the same features for red wine. The data is provided by Cortez et al. as apart of their 2009 paper Modeling wine preferences by data mining from physicochemical properties published by Elsevier.

In the author’s own words:

This dataset is publicly available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

A more detailed description of the data set can be found in the README.md file that accompanies this report.

Univariate Plots Section

Initial Summary Exploration

When exploring data for the first time, it helps to get a very high level view of the whole data set in the hope of getting an idea where to zoom in and explore in more detail.

To begin with, what is the size and shape of the data set? This summary includes the head of the data frame.

White Wine

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Red Wine

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The data for both red and white wines are described by 12 variables. There are 4898 observations for the white wine, but only 1599 observations for the red wine.

What are the summary statistics for each feature?

White Wine

##                      vars    n   mean    sd median trimmed   mad  min    max  range skew kurtosis   se
## fixed.acidity           1 4898   6.85  0.84   6.80    6.82  0.74 3.80  14.20  10.40 0.65     2.17 0.01
## volatile.acidity        2 4898   0.28  0.10   0.26    0.27  0.09 0.08   1.10   1.02 1.58     5.08 0.00
## citric.acid             3 4898   0.33  0.12   0.32    0.33  0.09 0.00   1.66   1.66 1.28     6.16 0.00
## residual.sugar          4 4898   6.39  5.07   5.20    5.80  5.34 0.60  65.80  65.20 1.08     3.46 0.07
## chlorides               5 4898   0.05  0.02   0.04    0.04  0.01 0.01   0.35   0.34 5.02    37.51 0.00
## free.sulfur.dioxide     6 4898  35.31 17.01  34.00   34.36 16.31 2.00 289.00 287.00 1.41    11.45 0.24
## total.sulfur.dioxide    7 4898 138.36 42.50 134.00  136.96 43.00 9.00 440.00 431.00 0.39     0.57 0.61
## density                 8 4898   0.99  0.00   0.99    0.99  0.00 0.99   1.04   0.05 0.98     9.78 0.00
## pH                      9 4898   3.19  0.15   3.18    3.18  0.15 2.72   3.82   1.10 0.46     0.53 0.00
## sulphates              10 4898   0.49  0.11   0.47    0.48  0.10 0.22   1.08   0.86 0.98     1.59 0.00
## alcohol                11 4898  10.51  1.23  10.40   10.43  1.48 8.00  14.20   6.20 0.49    -0.70 0.02
## quality                12 4898   5.88  0.89   6.00    5.85  1.48 3.00   9.00   6.00 0.16     0.21 0.01

Red Wine

##                      vars    n  mean    sd median trimmed   mad  min    max  range skew kurtosis   se
## fixed.acidity           1 1599  8.32  1.74   7.90    8.15  1.48 4.60  15.90  11.30 0.98     1.12 0.04
## volatile.acidity        2 1599  0.53  0.18   0.52    0.52  0.18 0.12   1.58   1.46 0.67     1.21 0.00
## citric.acid             3 1599  0.27  0.19   0.26    0.26  0.25 0.00   1.00   1.00 0.32    -0.79 0.00
## residual.sugar          4 1599  2.54  1.41   2.20    2.26  0.44 0.90  15.50  14.60 4.53    28.49 0.04
## chlorides               5 1599  0.09  0.05   0.08    0.08  0.01 0.01   0.61   0.60 5.67    41.53 0.00
## free.sulfur.dioxide     6 1599 15.87 10.46  14.00   14.58 10.38 1.00  72.00  71.00 1.25     2.01 0.26
## total.sulfur.dioxide    7 1599 46.47 32.90  38.00   41.84 26.69 6.00 289.00 283.00 1.51     3.79 0.82
## density                 8 1599  1.00  0.00   1.00    1.00  0.00 0.99   1.00   0.01 0.07     0.92 0.00
## pH                      9 1599  3.31  0.15   3.31    3.31  0.15 2.74   4.01   1.27 0.19     0.80 0.00
## sulphates              10 1599  0.66  0.17   0.62    0.64  0.12 0.33   2.00   1.67 2.42    11.66 0.00
## alcohol                11 1599 10.42  1.07  10.20   10.31  1.04 8.40  14.90   6.50 0.86     0.19 0.03
## quality                12 1599  5.64  0.81   6.00    5.59  1.48 3.00   8.00   5.00 0.22     0.29 0.02

Before plotting the features, it is worth doing an analysis to see if there are any obvious outliers that affect the data as a whole. For this analysis, I will be using Cook’s distance based on a linear model for the quality feature for each of the two types of wine. The analysis will use a threshold of 1 for points that exert disproportional influence on the model.

White Wine

Red Wine

This leaves us with a single outlier within the white wine data. From this point on, this datum will be excluded from the analysis. Any outliers that are present for specific features of the data may be excluded for individual plots (and will be mentioned if this is the case), but those outliers will not be excluded from the dataset when discussing other features.

White Wines

Now let’s take a look at the general distribution for each of the white wine features. As this is the first time we are looking at each of these distributions, there isn’t a specific justification for each plot apart from the fact that we want to see how that feature is distributed.

## [1] "Summary statistics for White Wine: Fixed Acidity"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4897 6.85 0.84    6.8    6.82 0.74 3.8 14.2  10.4 0.65     2.17 0.01
## [1] "Quantiles for White Wine: Fixed Acidity"
##   0%  25%  50%  75% 100% 
##  3.8  6.3  6.8  7.3 14.2 
## [1] "Interquartile Range for White Wine: Fixed Acidity"
## [1] 1

The majority of the fixed acidity data for white wine is reasonably symmetrically distributed around the mean with exception of a few outliers to the right of the distribution.

## [1] "Summary statistics for White Wine: Volatile Acidity"
##    vars    n mean  sd median trimmed  mad  min max range skew kurtosis se
## X1    1 4897 0.28 0.1   0.26    0.27 0.09 0.08 1.1  1.02 1.54      4.8  0
## [1] "Quantiles for White Wine: Volatile Acidity"
##   0%  25%  50%  75% 100% 
## 0.08 0.21 0.26 0.32 1.10 
## [1] "Interquartile Range for White Wine: Volatile Acidity"
## [1] 0.11

Volatile acidity for white wine has a slightly skewed right distribution with quite a few outliers beyond the right tail of the distribution.

## [1] "Summary statistics for White Wine: Citric Acid"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis se
## X1    1 4897 0.33 0.12   0.32    0.33 0.09   0 1.66  1.66 1.28     6.18  0
## [1] "Quantiles for White Wine: Citric Acid"
##   0%  25%  50%  75% 100% 
## 0.00 0.27 0.32 0.39 1.66 
## [1] "Interquartile Range for White Wine: Citric Acid"
## [1] 0.12

Citric acid for white wine has quite a narrow peak and is quite symmetrically distributed around the mean with only a few distant outliers beyond the right tail of the distribution.

## [1] "Summary statistics for White Wine: Residual Sugar"
##    vars    n mean sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4897 6.38  5    5.2     5.8 5.34 0.6 31.6    31 0.79    -0.22 0.07
## [1] "Quantiles for White Wine: Residual Sugar"
##   0%  25%  50%  75% 100% 
##  0.6  1.7  5.2  9.9 31.6 
## [1] "Interquartile Range for White Wine: Residual Sugar"
## [1] 8.2

Residual sugar for white wine is very skewed to the right. There is a very narrow peak close to 0 with a long right-hand tail and a few outliers beyond the tail.

## [1] "Summary statistics for White Wine: Chlorides"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 4897 0.05 0.02   0.04    0.04 0.01 0.01 0.35  0.34 5.02    37.53  0
## [1] "Quantiles for White Wine: Chlorides"
##    0%   25%   50%   75%  100% 
## 0.009 0.036 0.043 0.050 0.346 
## [1] "Interquartile Range for White Wine: Chlorides"
## [1] 0.014

Chlorides for white wine has a very sharp peak just to left of the mean with very flat tails. The right-hand tail is very long with quite a few outliers beyond the tail.

## [1] "Summary statistics for White Wine: Free Sulfur Dioxide"
##    vars    n  mean sd median trimmed   mad min max range skew kurtosis   se
## X1    1 4897 35.31 17     34   34.36 16.31   2 289   287 1.41    11.46 0.24
## [1] "Quantiles for White Wine: Free Sulfur Dioxide"
##   0%  25%  50%  75% 100% 
##    2   23   34   46  289 
## [1] "Interquartile Range for White Wine: Free Sulfur Dioxide"
## [1] 23

Free sulfur dioxide for white wine is quite symmetrically distributed around the mean with a slightly longer tail to the right with a few outliers close to the tail and one quite far beyond the right-hand tail.

## [1] "Summary statistics for White Wine: Total Sulfur Dioxide"
##    vars    n   mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 4897 138.36 42.5    134  136.95  43   9 440   431 0.39     0.57 0.61
## [1] "Quantiles for White Wine: Total Sulfur Dioxide"
##   0%  25%  50%  75% 100% 
##    9  108  134  167  440 
## [1] "Interquartile Range for White Wine: Total Sulfur Dioxide"
## [1] 59

Total sulfur dioxide for white wine has quite a wide peak with a steep slope on the left side of the distribution and shallower slope on the right-hand side. there are quite a few outliers beyond the right tail of the distribution.

## [1] "Summary statistics for White Wine: Density"
##    vars    n mean sd median trimmed mad  min  max range skew kurtosis se
## X1    1 4897 0.99  0   0.99    0.99   0 0.99 1.01  0.02 0.31     -0.4  0
## [1] "Quantiles for White Wine: Density"
##      0%     25%     50%     75%    100% 
## 0.98711 0.99172 0.99374 0.99610 1.01030 
## [1] "Interquartile Range for White Wine: Density"
## [1] 0.00438

## [1] "Summary statistics for White Wine: pH"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 4897 3.19 0.15   3.18    3.18 0.15 2.72 3.82   1.1 0.46     0.53  0
## [1] "Quantiles for White Wine: pH"
##   0%  25%  50%  75% 100% 
## 2.72 3.09 3.18 3.28 3.82 
## [1] "Interquartile Range for White Wine: pH"
## [1] 0.19

## [1] "Summary statistics for White Wine: Sulphates"
##    vars    n mean   sd median trimmed mad  min  max range skew kurtosis se
## X1    1 4897 0.49 0.11   0.47    0.48 0.1 0.22 1.08  0.86 0.98     1.59  0
## [1] "Quantiles for White Wine: Sulphates"
##   0%  25%  50%  75% 100% 
## 0.22 0.41 0.47 0.55 1.08 
## [1] "Interquartile Range for White Wine: Sulphates"
## [1] 0.14

## [1] "Summary statistics for White Wine: Alcohol"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 4897 10.51 1.23   10.4   10.43 1.48   8 14.2   6.2 0.49     -0.7 0.02
## [1] "Quantiles for White Wine: Alcohol"
##   0%  25%  50%  75% 100% 
##  8.0  9.5 10.4 11.4 14.2 
## [1] "Interquartile Range for White Wine: Alcohol"
## [1] 1.9

## [1] "Summary statistics for White Wine: Quality"
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 4897 5.88 0.89      6    5.85 1.48   3   9     6 0.16     0.21 0.01
## [1] "Quantiles for White Wine: Quality"
##   0%  25%  50%  75% 100% 
##    3    5    6    6    9 
## [1] "Interquartile Range for White Wine: Quality"
## [1] 1

Let’s also take a look at the kernel density estimate for the quality feature.

Red Wines

Now let’s take a look at the general distribution for each of the red wine features. As this is the first time we are looking at each of these distributions, there isn’t a specific justification for each plot apart from the fact that we want to see how that feature is distributed.

## [1] "Summary statistics for Red Wine: Fixed Acidity"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 1599 8.32 1.74    7.9    8.15 1.48 4.6 15.9  11.3 0.98     1.12 0.04
## [1] "Quantiles for Red Wine: Fixed Acidity"
##   0%  25%  50%  75% 100% 
##  4.6  7.1  7.9  9.2 15.9 
## [1] "Interquartile Range for Red Wine: Fixed Acidity"
## [1] 2.1

## [1] "Summary statistics for Red Wine: Volatile Acidity"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1599 0.53 0.18   0.52    0.52 0.18 0.12 1.58  1.46 0.67     1.21  0
## [1] "Quantiles for Red Wine: Volatile Acidity"
##   0%  25%  50%  75% 100% 
## 0.12 0.39 0.52 0.64 1.58 
## [1] "Interquartile Range for Red Wine: Volatile Acidity"
## [1] 0.25

## [1] "Summary statistics for Red Wine: Citric Acid"
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis se
## X1    1 1599 0.27 0.19   0.26    0.26 0.25   0   1     1 0.32    -0.79  0
## [1] "Quantiles for Red Wine: Citric Acid"
##   0%  25%  50%  75% 100% 
## 0.00 0.09 0.26 0.42 1.00 
## [1] "Interquartile Range for Red Wine: Citric Acid"
## [1] 0.33

Let’s take a closer look at the citric acid for red wines and see if we can clarify the shape of the distribution.

It appears that citric acid for red wine is somewhat uniformly distributed with a few peaks at 0, 0.2, 0.24, and 0.49. The values begin to tail off above 0.5 without any values between 0.79 and an outlier at 1.0.

## [1] "Summary statistics for Red Wine: Residual Sugar"
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 1599 2.54 1.41    2.2    2.26 0.44 0.9 15.5  14.6 4.53    28.49 0.04
## [1] "Quantiles for Red Wine: Residual Sugar"
##   0%  25%  50%  75% 100% 
##  0.9  1.9  2.2  2.6 15.5 
## [1] "Interquartile Range for Red Wine: Residual Sugar"
## [1] 0.7

## [1] "Summary statistics for Red Wine: Chlorides"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1599 0.09 0.05   0.08    0.08 0.01 0.01 0.61   0.6 5.67    41.53  0
## [1] "Quantiles for Red Wine: Chlorides"
##    0%   25%   50%   75%  100% 
## 0.012 0.070 0.079 0.090 0.611 
## [1] "Interquartile Range for Red Wine: Chlorides"
## [1] 0.02

## [1] "Summary statistics for Red Wine: Free Sulfur Dioxide"
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 1599 15.87 10.46     14   14.58 10.38   1  72    71 1.25     2.01 0.26
## [1] "Quantiles for Red Wine: Free Sulfur Dioxide"
##   0%  25%  50%  75% 100% 
##    1    7   14   21   72 
## [1] "Interquartile Range for Red Wine: Free Sulfur Dioxide"
## [1] 14

## [1] "Summary statistics for Red Wine: Total Sulfur Dioxide"
##    vars    n  mean   sd median trimmed   mad min max range skew kurtosis   se
## X1    1 1599 46.47 32.9     38   41.84 26.69   6 289   283 1.51     3.79 0.82
## [1] "Quantiles for Red Wine: Total Sulfur Dioxide"
##   0%  25%  50%  75% 100% 
##    6   22   38   62  289 
## [1] "Interquartile Range for Red Wine: Total Sulfur Dioxide"
## [1] 40

## [1] "Summary statistics for Red Wine: Density"
##    vars    n mean sd median trimmed mad  min max range skew kurtosis se
## X1    1 1599    1  0      1       1   0 0.99   1  0.01 0.07     0.92  0
## [1] "Quantiles for Red Wine: Density"
##       0%      25%      50%      75%     100% 
## 0.990070 0.995600 0.996750 0.997835 1.003690 
## [1] "Interquartile Range for Red Wine: Density"
## [1] 0.002235

## [1] "Summary statistics for Red Wine: pH"
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 1599 3.31 0.15   3.31    3.31 0.15 2.74 4.01  1.27 0.19      0.8  0
## [1] "Quantiles for Red Wine: pH"
##   0%  25%  50%  75% 100% 
## 2.74 3.21 3.31 3.40 4.01 
## [1] "Interquartile Range for Red Wine: pH"
## [1] 0.19

## [1] "Summary statistics for Red Wine: Sulphates"
##    vars    n mean   sd median trimmed  mad  min max range skew kurtosis se
## X1    1 1599 0.66 0.17   0.62    0.64 0.12 0.33   2  1.67 2.42    11.66  0
## [1] "Quantiles for Red Wine: Sulphates"
##   0%  25%  50%  75% 100% 
## 0.33 0.55 0.62 0.73 2.00 
## [1] "Interquartile Range for Red Wine: Sulphates"
## [1] 0.18

## [1] "Summary statistics for Red Wine: Alcohol"
##    vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 1599 10.42 1.07   10.2   10.31 1.04 8.4 14.9   6.5 0.86     0.19 0.03
## [1] "Quantiles for Red Wine: Alcohol"
##   0%  25%  50%  75% 100% 
##  8.4  9.5 10.2 11.1 14.9 
## [1] "Interquartile Range for Red Wine: Alcohol"
## [1] 1.6

## [1] "Summary statistics for Red Wine: Quality"
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1599 5.64 0.81      6    5.59 1.48   3   8     5 0.22     0.29 0.02
## [1] "Quantiles for Red Wine: Quality"
##   0%  25%  50%  75% 100% 
##    3    5    6    6    8 
## [1] "Interquartile Range for Red Wine: Quality"
## [1] 1

Let’s also take a look at the kernel density estimate for the quality feature.

Univariate Analysis

What is the structure of your dataset?

The two datasets contain a total of 6497 observations with 4898 for white wines and 1599 for reds. Each observation is described by 12 variables: (this description of the variables comes from the data description authored by Cortez et al.)

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume)
  12. quality (score between 0 and 10). This is the output variable (based on sensory data).

All 12 variables are numeric. The following variables represent integer values: free sulfur dioxide, total sulfur dioxide, and quality. The other 9 variables represent floating point numbers.

The histograms for each of the 12 features for both data sets give us a good indication of the distributions. Nearly all of the histograms are skewed to the right with more outliers in the right-hand tails. The notable exceptions are pH which appears to be reasonably normally distributed as is density for red wine. The quality histograms immediately stand out because that feature for both wines only contains integer values between 3 and 9 for white wines and 3 and 8 for reds. The disparity between the number of observations for whites compared to reds becomes very clear. Some of the plots for the red wines are much less clearly defined because there are so many fewer observations. In particular, the citric acid plot for red wine has a distinct lack of structure in the distribution.

What is/are the main feature(s) of interest in your dataset?

There isn’t a particular feature of interest in the data that stands out. What I am interested in finding out is whether there are any distinct differences in the physical characteristics between white and red wine. Just as importantly, I would like to know if there is a strong correlation between any of the physical characteristics of the wine and the perceived quality of that wine. If there are, are these physical qualities different for white and red wines?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Only further investigation will determine if this is true, but I have a feeling that extreme values in the physical characteristics of a wine will negatively impact the quality. This is merely a conjecture, but I think if the acidity is too low or two high or the sulphur dioxide is too low or too high, this is likely to lead to particularly low scoring wines. If this is true, I imagine then that wines that tend reside near the peaks for each of the physical characteristics will have above average quality scores.

Did you create any new variables from existing variables in the dataset?

No. I couldn’t think of a characteristic of the data that needed to be described by a new variable composed of the existing variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The data is already tidy, so further tidying wasn’t needed.

I performed outlier analysis using Cook’s distance based on a linear model for the quality feature for each of the two types of wine with a threshold of 1. This analysis found a single outlier in the white wine data. It has been removed for all subsequent analysis. Some of the individual features have clear outliers in their distributions, but because the Cook’s distance didn’t flag them as having disproportional influence on the data as a whole, these records are unlikely to be outliers for more than a few of the features. Removing them from the data would be a mistake. Instead, these feature specific outliers may be excluded from individual plots when doing further analysis, but remain within the data as a whole.

Of all 24 distributions the citric acid observations for red wine required further analysis. This is primarily because the distribution wasn’t clear when plotted with the same bin width as the other histograms. Even when the bin width was decreased to get a better sense of the distribution’s shape, it was found that distribution lacked a coherent modal or even multi-modal shape.

Bivariate Plots Section

Correlations

White Wine Feature Correlations

The White Wine features don’t appear to be very highly correlated with each other. There are only three pairs of features with a Pearson product-moment correlation coefficient (r) score greater than 0.5 and only 1 pair with a score less than 0.5.

  • Density / Residual Sugar: 0.83
  • Total Sulfur Dioxides / Free Sulfur Dioxide: 0.62
  • Density / Total Sulfur Dioxide: 0.54
  • Density / Alcohol: -0.8

Red Wine Feature Correlations

The Red Wine features don’t appear to be very highly correlated with each other either. Like the White Wines, there are only three pairs of features with a Pearson product-moment correlation coefficient (r) score greater than 0.5, but there are four pairs with a score less than or equal to 0.5. Interestingly, there are only two that overlap with the White Wines.

  • Density / Fixed Acidity: 0.67
  • Citric Acid / Fixed Acidity: 0.67
  • Total Sulfur Dioxides / Free Sulfur Dioxide: 0.67
  • Density / Alcohol: -0.5
  • Citric Acid: / pH: -0.54
  • Citric Acid / Volatile Acidity: -0.55
  • pH / Fixed Acidity: -0.68

The largest correlation scores with the quality of each of the types of wine are with the alcohol content (r 0.44 for White Wine and r 0.48 for Red Wines). These aren’t a particularly large scores and likely shed more light on how some of the reviewers providing the quality scores prefer stronger drinks than the wine itself.

White Wines

What do the four strongest relationships in the white wines data look like (from the strongest positive correlation to the strongest negative)?

Density / Residual Sugar: r 0.83

Total Sulfur Dioxides / Free Sulfur Dioxide: r 0.62

Density / Total Sulfur Dioxide: r 0.54

Density / Alcohol: r -0.8

Red Wines

What do the seven strongest relationships in the red wines data look like (from the strongest positive correlation to the strongest negative)?

Density / Fixed Acidity: r 0.67

Citric Acid / Fixed Acidity: r 0.67

Total Sulfur Dioxides / Free Sulfur Dioxide: r 0.67

Density / Alcohol: r -0.5

Citric Acid: / pH: r -0.54

Citric Acid / Volatile Acidity: r -0.55

pH / Fixed Acidity: r -0.68

Quality

We have seen that alcohol has the strongest relationship with quality for both types of wine. Let’s take a look at what these two distributions look like.

From this plot, we can see that alcohol has a very slight negative relationship with the quality score below a score of 5, but it has a much stronger positive relationship for scores greater than 5. The score with the most outliers is 5 with all of the outliers representing alcohol levels above the interquartile range.

The relationship between alcohol and quality for red wine is very similar to that of white wine following almost the same trends. The main difference, as has been noted previously, is that there are wines with a quality score of 9 for white wines but none for red wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Unfortunately, there doesn’t appear to be a strong relationship between the quality of the wine and any of its physical characteristics. The strongest of these relationships is the one between alcohol and quality though the correlation r score is only 0.44 for white one and 0.48 for red. This suggests that there may be a bias for some of the reviewers toward a stronger wine appearing to be of higher quality though this isn’t something that can be confirmed with the data available (as we can’t determine which reviewed which wine). When tasting wine, the quantity of alcohol is one of the least subtle qualities to detect. It is possible that reviewers latched onto this quality to differentiate their preference for the different wines.

The fact that there isn’t a clear relationship between quality and any specific physical characteristic does tell something about how wine is perceived. It is plausible that the physical character of the wine isn’t the primary predictor of quality. There are likely to be other confounding factors not captured by this data. These could include the colour, context, shape of the glass, or whether the wine was perceived to be cheap or expensive.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It isn’t surprising that the strongest bivariate relationships in the data follow physical relationships in the chemistry and fermentation of wine. The strong relationship between density and residual sugar suggests that the density of the wine is highly influenced by the measure of the residual sugar. Given that the fermentation process causes sugar to turn into alcohol, it makes sense that there is an inverse relationship that is nearly as strong between density and alcohol. There is also a reasonably strong relationship between total sulfur dioxide and free sulfur dioxide (r 0.62 for white wine and 0.67 for red wine). For red wine, there is a strong positive relationship between citric acid and fixed acidity but a moderately strong inverse relationship between citric acid and volatile acidity along with reasonably strong negative relationships between pH and citric acid and pH and fixed acidity. This makes sense as all of these chemical properties are associated with each other.

What is interesting is that the relationships between the acidic features are less prominent for white wine than they are for red wine.

What was the strongest relationship you found?

The strongest relationship I found was between density and residual sugar for white wine. These two features had the largest Pearson r score of 0.83. This was closely followed by a negative correlation between alcohol and density (also for white wine) with an r score of -0.8.

Multivariate Plots Section

We have established that the relationships between quality and the physical features of the wine aren’t very strong, but they do exist. Not only is there some correlation between the quality and physical features, but the nature of these relationships differ for red and white wines. We have also established that strongest relationship is between alcohol and quality so now let’s take a look at the next three strongest relationships for each of the two types of wine, plot them against alcohol and colour the plot by the quality for each datum.

Let’s see what we can find.

White Wines

For white wines the four strongest relationships with quality are:

  • Alcohol (r 0.44)
  • Density (r -0.31)
  • Chlorides (r -0.21)
  • Total Sulfur Dioxide (r -0.17)

The three resulting plots are as follows:

Red Wines

For red wines the four strongest relationships with quality are:

  • Alcohol (r 0.48)
  • Volatile Acidity (r -0.39)
  • Sulphates (r 0.25)
  • Citric Acid (r 0.23)

The three resulting plots are as follows:

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

By looking at the relationships between alcohol (the feature with the strongest relationship with quality) and features with the 2nd through 4th strongest relationships with quality for both white and red wines, we begin to get a sense of what influences the quality score for a given wine. The correlation matrices for each type of wine showed us that these relationships differ for white and red wines. These multivariate plots tell a subtle story. The strength of the relationship between alcohol and quality (common for all six plots) is quite clear. It is the relationship between quality and the other six features that is far more subtle.

There is a clear negative correlation between density and alcohol for white wine (r -0.8). As the alcohol content increases so does the quality which means that as the density decreases, the quality increases. This is to be expected given each feature’s correlation to each of the others.

White wines with high chloride levels tend to both be lower quality and have a lower percentage of alcohol.

White wines high or low leaves of total sulfur dioxide also tend to receive lower scores for quality. Even if this is the case, of the relationships discussed so far, this is the least clear.

The relationships for red wine are a bit easier to read as there is less data being plotted. Red wines with a high volatile acidity tend to be lower quality and contain a medium amount of alcohol at most.

Red wines that receive the highest quality score tend to have a low to medium quantity of sulphates.

Of all six plots, the plot of the relationship between citric acid and alcohol for red wines is the most spread out. Citric acid is most evenly distributed for wines with lower levels of alcohol wheres higher alcohol wines tend to have citric acid levels below 0.125 or between 0.3 and 0.7 g / dm^3.

Were there any interesting or surprising interactions between features?

When plotting chlorides and alcohol by quality for white wines, there is a distinct spike in the variability of a wine’s chloride levels when its alcohol is between 8.5 and 10. For these wines that have elevated chlorides, their quality tends to be low or medium (between 3 and 6 out of 9).

My idea that extreme values of sulfur dioxide would result in lower quality scores is validated (to some extent) when plotting total sulfur dioxide and alcohol by quality for white wines. For this plot, the highest quality wines tend to alcohol levels above 10% and total sulfur dioxide levels between 75 and 200 mg / dm^3.

The plots for chlorides vs alcohol for white wine and sulphates vs alcohol for red wine have a very similar shape though the shape is less clear for red wines. This is probably because there is less data for red wines. The greatest variability of sulphates in red wine appears to occur in wines with less than average alcohol content. Though the wines with large amounts of sulphate appear to be lower quality, this relationship is less clear than that of chlorides and quality for white wine.


Final Plots and Summary

Plot One

One of the most interesting concepts pursued during this exploration is the idea that white wines and red wines differ at a much more fundamental level than their colour. To begin with, the reviewers who scored the wines tend to be more critical of red wines than white wines. This plot compares the kernel density estimates for distribution of quality scores for the two types of wines. This is an effective way of comparing these two distributions as it normalises the differences that are introduced by the disparity in size of the two data sets. Comparing two histograms would be fruitless when there are more than twice as many white wines in the data.

From this plot, we can clearly see that white wines are more likely to receive a score of 6 or higher than reds. White wines are also more likely to receive a very high score where reds are far more likely to receive a score between 5 and 7 with many more receiving 5 than 7. It is hard to see in this plot as the number of wines receiving a score of 9 is very small, but all the wines receiving a quality score of 9 are white wines.

Plot Two

TODO: Write this description

Plot Three

With this third plot, I wanted to illustrate how white wines and red wines are subtly different in their physical characteristics and how this impact their quality. The lack of clear relationships in the data made this task very difficult. The fact that the quality of white and red wines is influenced by different physical characteristics makes direct comparison quite challenging. I chose to compare the two features that influence red and white wine quality the most but only apply to one type of wine or the other. As we discovered earlier, the quality for both types of wine is most influenced by the quantity of alcohol in them. The next most influential feature for white wine is density and for red wine, it is citric acid.

I tried to layer alcohol and quality information into the plot by using those values to control the alpha and size of the points. Though this information isn’t clear for the points in the central cluster, it is possible to discern these details for points at the periphery, so I chose to keep the information as a part of the plot.

Apart from the general trend of white wine having a larger range of density than red wine and white wine having a larger cluster of data with similar quantities of citric acid, a very striking feature of this plot is the disproportional amount of data with ‘round’ values for citric acid. There are very clear bands of wine with citric acid values of 0, 0.5 and 0.75 g/dm^3. I suspect there is a similar band at 0.25 g/dm^3 though it is hard to be sure as the band appears to be obscured by the cluster of data with this level of citric acid.


Reflection

This data appears to validate the notion that wine flavours can be very subtle and it takes a discerning palette to differentiate between the traits of different wines. Though there are relationships between the different physical characteristics of a wine and its quality, these relationships are quite weak. Even alcohol, with the strongest relationship with quality, still has a Pearson product-moment correlation coefficient (r) score less than 0.5 for both white and red wines.

These subtle relationships made this exploration quite challenging. Without clear relationships to latch on to, the exploration proved quite nebulous and vague. The clearest finding from the exploration is that the difference between white and red wines isn’t merely their colour. The characteristics that help differentiate a low-quality wine and a high-quality one are different for red and white wines though these relationships are weak and may be ignored if the data contained more striking correlations.

Rendering the joint correlations plot was a struggle and took quite a lot of tweaking. I’m not entirely happy with it. Ideally, I would like the classification of the two triangles (red and white) to be clear visually (rather than requiring a textual description) but was unable to render the distinction as I had wanted. For some reason, lines I was drawing over the top of the plot were getting clipped and not displaying half down the last row of the plot. I felt like my attempts looked messy, so I decided not to pursue it further.

Rendering the correlation matrices were the most illuminating part of the exploration process. Seeing the differences between the correlation values for white and red wines was the first place I saw evidence that the physical characteristics of the wines not only differ between red and white wines, but these characteristics do have differing impacts on the perceived quality of the two types of wine.

I would be interested in performing further analysis with wine data that included other data that would likely influence perceived quality such as perceived flavour (sweet, dry, etc.), colour, context, shape of the glass, or whether the wine was perceived to be cheap or expensive. It would also be interesting to know if individual reviewers reviewed multiple wines and whether there were trends in their reviews. For example, were individual reviewers more or less sensitive to certain physical features of the wine?

Resources